Support Vector Machine - Capstone Project

Author

Gopi Shankar Reddy Mallu, Kavya Reddy Maale, Satya Nageswara Dinesh Donkada, Vamsi Krishna Kalla

Published

February 27, 2024

Introduction

Support Vector Machines (SVMs) are a type of supervised learning algorithm that can be used for classification or regression tasks. The main idea behind SVMs is to find a hyperplane that maximally separates the different classes in the training data. This is done by finding the hyperplane that has the largest margin, which is defined as the distance between the hyperplane and the closest data points from each class. Once the hyperplane is determined, new data can be classified by determining on which side of the hyperplane it falls. SVMs are particularly useful when the data has many features, and/or when there is a clear margin of separation in the data.

Hyperplanes are decision boundaries that help classify the data points. Data points falling on either side of the hyperplane can be attributed to different classes. Also, the dimension of the hyperplane depends upon the number of features. If the number of input features is 2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a two-dimensional plane. It becomes difficult to imagine when the number of features exceeds 3.

Literature Review

Support Vector Machines (SVM) have emerged as a potent tool in the realm of supervised learning, offering a robust mathematical framework for both classification and regression tasks. With a foundation rooted in principles such as structural risk minimization and kernel functions (Jakkula, 2006), SVM has demonstrated exceptional generalization capabilities, adeptly handling non-linear decision boundaries through kernel tricks (Jun, 2021; Deris, 2011). Despite challenges like computational cost and scalability (Bhavsar, 2012), the evolution of SVM has led to significant contributions in diverse fields, including pattern recognition, computer vision (Kecman, 2005), and even agriculture, where it aids in optimizing crop yield and disease identification (Kumar et al., 2017). The ongoing advancements in SVM research are geared towards refining algorithms and broadening their application spectrum, especially in the context of burgeoning data volumes (Yue, 2003).

In the financial and healthcare sectors, SVM has proven its efficacy in various applications. It has been utilized to construct reliable stock market prediction models by analyzing financial indices like Earnings Per Share (EPS) and Net Profit Growth Rate (NPGR) (Han, 2007). In healthcare, SVM has been instrumental in developing advanced diagnostic tools, such as the optimized SVM model for early dementia prediction (Javeed et al., 2023) and the multi-disease prediction model using an improved SVM-radial bias kernel approach (Harimoorthy & Thangavelu, 2021). These innovations underscore the potential of machine learning in revolutionizing healthcare by facilitating early diagnosis and personalized treatment plans.

SVM’s application extends to domains like online retail and network security, where it addresses complex challenges with remarkable efficiency. In online marketplaces, SVM combined with Particle Swarm Optimization has enhanced the accuracy of text classification for customer reviews (Sahara et al., 2023), providing valuable insights for sellers. In the realm of network security, innovative approaches such as combining SVM with naïve Bayes feature embedding have been proposed for intrusion detection, achieving high accuracy rates in identifying network threats (Jie Gu et al, 2021). Moreover, the development of hybrid methods for attack detection, which integrate SVM features with evolutionary algorithms and artificial neural networks, has shown significant promise in reducing dimensionality and training time while maintaining high detection accuracy (Soodeh Hosseini et al, 2020)

Machine learning techniques, particularly SVM, are revolutionizing various fields by addressing complex challenges with precision and efficiency. In healthcare, SVM has been applied to electronic health records for cancer classification, achieving high accuracy rates in identifying different types of malignancies (K. Ghanem et al, 2021). Furthermore, SVM’s versatility is evident in its application across domains such as finance, where it has been used to assess credit risk for small and medium enterprises in supply chain finance (Zhang, Hu, & Zhang, 2015), and in cloud-based services, where it ensures data confidentiality and decision verifiability in health monitoring systems (Liang et al., 2021). These advancements highlight the transformative potential of machine learning techniques in enhancing diagnostic accuracy, optimizing financial assessments, and ensuring secure cloud-based services.

Dataset

Customer retention is a critical aspect for banks to ensure the sustainability of their operations. ABC Multinational Bank, in particular, places a strong emphasis on retaining its account holders. The primary objective of this analysis is to examine the customer data of the bank’s account holders to predict and prevent customer churn effectively.

The dataset under consideration contains information about account holders at ABC Multinational Bank, with the ultimate goal of predicting customer churn. The dataset comprises the following columns:

Column Name Description
customer_id A unique identifier for each customer, not used in the analysis.
credit_score A numerical representation of the customer’s creditworthiness.
country The country in which the customer resides.
gender The gender of the customer (e.g., male, female).
age The age of the customer in years.
tenure The number of years the customer has been with the bank.
balance The current balance in the customer’s account.
products_number The number of products the customer has with the bank.
credit_card Indicates whether the customer has a credit card with the bank.
active_member Indicates whether the customer is an active member.
estimated_salary The estimated annual salary of the customer.
churn The target variable, indicating customer churn (1 for churned, 0 for not churned).

Methodology

Mathematical Intuition of Support Vector Machine

Consider a binary classification task where there are two classes, denoted by the labels +1 and -1. The input feature vectors (X) and the matching class labels (Y) comprise our training dataset.

Equation for hyperplane can be written as:

\(w^Tx+b=0\)

The vector W represents the normal vector to the hyperplane. i.e the direction perpendicular to the hyperplane. The parameter b in the equation represents the offset or distance of the hyperplane from the origin along the normal vector w.

\(d_i = \frac{w^Tx_i + b}{\|w\|}\)

where ||w|| represents the Euclidean norm of the weight vector w. Euclidean norm of the normal vector W

\(\hat{y} =\begin{cases} 0 & \text{if } w^T x + b \geq 0 \\ 1 & \text{if } w^T x + b < 0 \end{cases}\)

kernel function in SVM

SVM kernel is a function that takes low-dimensional input space and transforms it into higher-dimensional space, ie it converts nonseparable problems to separable problems. It is mostly useful in non-linear separation problems. Simply put the kernel, does some extremely complex data transformations and then finds out the process to separate the data based on the labels or outputs defined.

\(Linear : K(w,b)=w^Tx+b\)

Data Analysis

Loading Libraries

Code
library(tidyverse)
library(dplyr)
library(ggplot2)
#install.packages("corrplot")
library(corrplot)

Load Data

Code
df <- read.csv("dataset/train.csv")
Code
head(df)
  id CustomerId        Surname CreditScore Geography Gender Age Tenure  Balance
1  0   15674932 Okwudilichukwu         668    France   Male  33      3      0.0
2  1   15749177  Okwudiliolisa         627    France   Male  33      1      0.0
3  2   15694510          Hsueh         678    France   Male  40     10      0.0
4  3   15741417            Kao         581    France   Male  34      2 148882.5
5  4   15766172      Chiemenam         716     Spain   Male  33      5      0.0
6  5   15771669       Genovese         588   Germany   Male  36      4 131778.6
  NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited
1             2         1              0       181449.97      0
2             2         1              1        49503.50      0
3             2         1              0       184866.69      0
4             1         1              1        84560.88      0
5             2         1              1        15068.83      0
6             1         1              0       136024.31      1

Summary Statistics

Code
summary(select(df, CreditScore, Age, Tenure, Balance, NumOfProducts, EstimatedSalary))
  CreditScore         Age            Tenure         Balance      
 Min.   :350.0   Min.   :18.00   Min.   : 0.00   Min.   :     0  
 1st Qu.:597.0   1st Qu.:32.00   1st Qu.: 3.00   1st Qu.:     0  
 Median :659.0   Median :37.00   Median : 5.00   Median :     0  
 Mean   :656.5   Mean   :38.13   Mean   : 5.02   Mean   : 55478  
 3rd Qu.:710.0   3rd Qu.:42.00   3rd Qu.: 7.00   3rd Qu.:119940  
 Max.   :850.0   Max.   :92.00   Max.   :10.00   Max.   :250898  
 NumOfProducts   EstimatedSalary    
 Min.   :1.000   Min.   :    11.58  
 1st Qu.:1.000   1st Qu.: 74637.57  
 Median :2.000   Median :117948.00  
 Mean   :1.554   Mean   :112574.82  
 3rd Qu.:2.000   3rd Qu.:155152.47  
 Max.   :4.000   Max.   :199992.48  

Credit Score:

  • The Credit Score ranges from a minimum of 350 to a maximum of 850.

  • The median Credit Score is 659, indicating that half of the customers have a score below 659 and half have a score above.

  • The mean Credit Score is approximately 656.5, suggesting that the average creditworthiness of customers is in the mid-range.

  • The 1st quartile (25th percentile) is 597, and the 3rd quartile (75th percentile) is 710, indicating that 50% of customers have a Credit Score between 597 and 710.

Age:

  • The Age of customers ranges from 18 to 92 years. The median age is 37 years, meaning half of the customers are younger than 37 and half are older.

  • The mean age is approximately 38.13 years, indicating that the average customer is in their late thirties.

  • The distribution of Age is slightly right-skewed, as the mean is slightly higher than the median.

Tenure:

  • Tenure, or the number of years customers have been with the bank, ranges from 0 to 10 years.

  • The median tenure is 5 years, indicating that half of the customers have been with the bank for less than 5 years and half for more.

  • The mean tenure is approximately 5.02 years, suggesting that the average customer has been with the bank for around 5 years.

Balance:

  • The account Balance ranges from a minimum of 0 to a maximum of 250,898.

  • The median balance is 0, indicating that at least half of the customers have no balance in their account.

  • The mean balance is approximately 55,478, suggesting that while many customers have low or zero balances, some have significant amounts in their accounts.

Number of Products:

  • The Number of Products customers have with the bank ranges from 1 to

  • The median number of products is 2, meaning that half of the customers have 2 or fewer products with the bank.

  • The mean number of products is approximately 1.554, indicating that on average, customers have between 1 and 2 products with the bank.

Estimated Salary:

  • The Estimated Salary ranges from a minimum of 11.58 to a maximum of 199,992.48.

  • The median estimated salary is 117,948, suggesting that half of the customers have an estimated salary below this amount and half above.

  • The mean estimated salary is approximately 112,574.82, indicating that the average estimated salary of customers is around 112k.

Count Of Categorical value types

Code
sapply(df[,c('Geography', 'Gender', 'HasCrCard', 'IsActiveMember', 'Exited')], function(x) length(unique(x)))
     Geography         Gender      HasCrCard IsActiveMember         Exited 
             3              2              2              2              2 

Checking null values

Code
colSums(is.na(df))
             id      CustomerId         Surname     CreditScore       Geography 
              0               0               0               0               0 
         Gender             Age          Tenure         Balance   NumOfProducts 
              0               0               0               0               0 
      HasCrCard  IsActiveMember EstimatedSalary          Exited 
              0               0               0               0 

There are no null values in the data.

Distribution of target variable

Code
table(df$Exited)

     0      1 
130113  34921 

We can see number of customers exited are more compared to number of customers not exited. So there is a quite imbalance in data which needs to be addressed while building the model.

Distribution of target variable across Geography.

Code
table(df$Geography, df$Exited)
         
              0     1
  France  78643 15572
  Germany 21492 13114
  Spain   29978  6235

France:

  • A total of 94,215 customers are from France.
  • Out of these, 78,643 customers have not exited the bank (retained),
  • while 15,572 customers have exited (churned).
  • The churn rate for France is approximately 16.53%.

Germany:

  • A total of 34,606 customers are from Germany.
  • Out of these, 21,492 customers have not exited the bank, while 13,114 customers have exited.
  • The churn rate for Germany is approximately 37.89%.

Spain:

  • A total of 36,213 customers are from Spain.
  • Out of these, 29,978 customers have not exited the bank, while 6,235 customers have exited.
  • The churn rate for Spain is approximately 17.21%.

Which Gender has highest Credit Score?

Code
aggregate(df$CreditScore, by = list(df$Gender), FUN = mean)
  Group.1        x
1  Female 656.2437
2    Male 656.6169

Observations:

  • The difference in average credit scores between male and female customers is minimal, indicating that gender does not significantly impact creditworthiness in this dataset.

  • Both genders have an average credit score in the mid-650s, which is considered a fair credit score range.

Distribution of Age.

Code
ggplot(df, aes(x = Age)) + geom_histogram(binwidth = 5, fill = "blue", color = "black")

Observations:

  • The largest concentration of customers falls within the 30 to 40-year-old range, indicating that the majority of customers are in their early to mid-career stages.

  • There is a significant drop in frequency as age increases, especially beyond 50 years. This suggests that the customer base skews younger.

  • The distribution is right-skewed, meaning there are fewer older customers (those over 60) compared to younger customers.

  • There is a small number of customers in the youngest age bracket (under 25 years) and the oldest (over 75 years).

Code
ggplot(df, aes(x = EstimatedSalary)) + geom_histogram(binwidth = 5, fill = "blue", color = "black")

Observations:

  • The distribution is quite uniform across different salary ranges, with no distinct peaks that would indicate a concentration of individuals around a specific salary bracket.

  • There are frequent spikes throughout the distribution, which may suggest that the data contains many unique values with small frequencies. This could be indicative of precise salary estimations rather than rounded figures.

  • The salaries range from very low values close to 0 up to 200,000, indicating a diverse group from potentially different economic backgrounds or job roles.

  • There is no obvious concentration of data points around the lower, middle, or upper salary range, which is unusual for income data where one typically expects to see more of a bell-shaped distribution centered around a median salary range.

Comparing the distribution of account balances between customers who have exited and customer who have not exited.

Code
ggplot(df, aes(x = as.factor(Exited), y = Balance)) + geom_boxplot()

Observations:

  • Balance Distribution:

    • The y-axis represents the balance on customer accounts, which seems to range from 0 to a bit over 250,000.

    • Both boxes have a similar interquartile range (IQR), which is the range between the first quartile (25th percentile) and the third quartile (75th percentile), represented by the height of the boxes. This suggests that the middle 50% of balances are similarly distributed between both groups.

    • The median, indicated by the line within each box, is roughly at the same level for both groups, suggesting that the central tendency of balance is similar regardless of whether the customer has exited or not.

  • Outliers:

    • There are visible outliers for both groups, indicated by the points beyond the whiskers of the box plot. These outliers represent customers with balances significantly higher than the general population of the dataset.

How the distribution of the number of products varies across different geographical regions?

Code
ggplot(df, aes(x = Geography, fill = as.factor(NumOfProducts))) + geom_bar(position = "dodge")

Observations:

  1. France:

    • France has the highest count of customers using one product, followed closely by those using two products. The number of customers using three and four products is significantly lower.
  2. Germany:

    • Germany shows a similar pattern to France with one and two products being the most common among customers. However, the count for one product is notably lower than in France, whereas the count for two products is slightly higher.
  3. Spain:

    • Spain’s pattern mirrors that of France and Germany, with one product being the most common, followed by two products. Again, three and four products are used by a considerably smaller number of customers.

Pairplot of Age vs Estimated Salary and also checking which age group and salary range have exited the bank.

Code
ggplot(df, aes(x = Age, y = EstimatedSalary, color = as.factor(Exited))) + geom_point()

Observations:

  1. There doesn’t appear to be a clear pattern or correlation between Age and Estimated Salary with customer churn, as the exited and non-exited customers are interspersed throughout the plot without any distinct clustering.

  2. Customers who have exited are spread across all ages and salary levels, but there seems to be a slightly higher concentration of churned customers in the 40 to 50 age range.

Pairplot of Age vs Credit Score and also checking which age group and Credit Score range have exited the bank.

Code
ggplot(df, aes(x = Age, y = CreditScore , color = as.factor(Exited))) + geom_point()

Observations:

  1. There is a wide distribution of Credit Scores across different ages with no clear pattern indicating that Credit Score by itself may not be a strong predictor of customer exit.

  2. Both exited and non-exited customers are found across the entire range of Credit Scores and Age, but there is a noticeable density of exited customers (blue dots) in the middle age range, particularly between ages 40 and 50.

Pairplot of EstimatedSalary vs Credit Score and also checking what Estimated Salary range and Credit Score range have exited the bank.

Code
ggplot(df, aes(x = EstimatedSalary, y = CreditScore , color = as.factor(Exited))) + geom_point()

Observations:

The scatter plot shows no clear correlation between Credit Score and Estimated Salary in predicting customer churn, with both customers who exited and those who did not evenly dispersed across all ranges of Salary and Credit Scores.

Correlation Plot

Code
corr_matrix <- cor(select(df, CreditScore, Age, Tenure, Balance, NumOfProducts, EstimatedSalary))
corrplot(corr_matrix, method = "circle")

Observations:

There seems to be a noticeable positive correlation between Age and Balance, and a negative correlation between NumOfProducts and Balance.

References

  1. Jakkula, V. (2006). Tutorial on support vector machine (svm). School of EECS, Washington State University37(2.5), 3.

  2. Kecman, V. (2005). Support vector machines-an introduction. In Support vector machines theory and applications (pp. 1-47). Berlin, Heidelberg: Springer Berlin Heidelberg

  3. Yue, S., Li, P., & Hao, P. (2003). SVM classification: Its contents and challenges. Applied Mathematics-A Journal of Chinese Universities, 18, 332-342.

  4. Jun, Z. (2021). The development and application of support vector machine. In Journal of Physics: Conference Series (Vol. 1748, No. 5, p. 052006). IOP Publishing.

  5. Bhavsar, H., & Panchal, M. H. (2012). A review on support vector machine for data classification. International Journal of Advanced Research in Computer Engineering & Technology (IJARCET), 1(10), 185-189.

  6. Deris, A. M., Zain, A. M., & Sallehuddin, R. (2011). Overview of support vector machine in modeling machining performances. Procedia Engineering24, 308-312.

  7. Han, Shuo. “Using SVM with Financial Statement Analysis for Prediction of Stocks.” Communications of the IIMA Communications of the IIMA, vol. 7, 2007, scholarworks.lib.csusb.edu/cgi/viewcontent.cgi?article=1059&context=ciima.

  8. Ahmadi, Muhammad Iqbal, et al. “SENTIMENT ANALYSIS ONLINE SHOP on the PLAY STORE USING METHOD SUPPORT VECTOR MACHINE (SVM).” Seminar Nasional Informatika (SEMNASIF), vol. 1, no. 1, 15 Dec. 2020, pp. 196–203, jurnal.upnyk.ac.id/index.php/semnasif/article/view/4101. Accessed 13 Feb. 2024.

  9. Razzaghi, Talayeh, et al. “Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values.” PLOS ONE, vol. 11, no. 5, 19 May 2016, p. e0155119, https://doi.org/10.1371/journal.pone.0155119.

  10. Öz, Ersoy, and Hüseyin Kaya. “Support Vector Machines for Quality Control of DNA Sequencing.” Journal of Inequalities and Applications, vol. 2013, no. 1, 4 Mar. 2013, https://doi.org/10.1186/1029-242x-2013-85. Accessed 15 June 2021.

  11. “Support Vector Machine for Network Intrusion and Cyber-Attack Detection | IEEE Conference Publication | IEEE Xplore.” Ieeexplore.ieee.org, ieeexplore.ieee.org/abstract/document/8233268. Accessed 13 Feb 2024.

  12. Kumar, Sachin, et al. “Precision Sugarcane Monitoring Using SVM Classifier.” Procedia Computer Science, vol. 122, 2017, pp. 881–887, https://doi.org/10.1016/j.procs.2017.11.450. Accessed 25 July 2019.

  13. Javeed, A. et al. (2023) Early prediction of dementia using feature Extraction Battery (FEB) and optimized support vector machine (SVM) for Classification, MDPI. Available at: https://www.mdpi.com/2227-9059/11/2/439 (Accessed: 22 January 2024).

  14. Nawal, Y., Oussalah, M., Fergani, B., & Fleury, A. (2022). New incremental SVM algorithms for human activity recognition in smart homes. Journal of Ambient Intelligence and Humanized Computing. https://doi.org/10.1007/s12652-022-03798-w

  15. Zhang, L., Hu, H., & Zhang, D. (2015). A credit risk assessment model based on SVM for small and medium enterprises in supply chain finance. Financial Innovation1(1). https://doi.org/10.1186/s40854-015-0014-5

  16. Harimoorthy, K., Thangavelu, M. RETRACTED ARTICLE: Multi-disease prediction model using improved SVM-radial bias technique in healthcare monitoring system. J Ambient Intell Human Comput 12, 3715–3723 (2021). https://doi.org/10.1007/s12652-019-01652-0

  17. J. Liang, Z. Qin, L. Xue, X. Lin and X. Shen, “Verifiable and Secure SVM Classification for Cloud-Based Health Monitoring Services,” in IEEE Internet of Things Journal, vol. 8, no. 23, pp. 17029-17042, 1 Dec.1, 2021, doi: 10.1109/JIOT.2021.3075540.

  18. G. N. Ahmad, H. Fatima, S. Ullah, A. Salah Saidi and Imdadullah, “Efficient Medical Diagnosis of Human Heart Diseases Using Machine Learning Techniques With and Without GridSearchCV,” in IEEE Access, vol. 10, pp. 80151-80173, 2022, doi: 10.1109/ACCESS.2022.3165792.

  19. Sahara, S., Annida Purnamawati, Sulaeman Hadi Sukmana, Mely Mailasari, Erma Delima Sikumbang, & Puji, E. (2023). PSO optimization for analysis of online marketplace products on the SVM method. AIP Conference Proceedings. https://doi.org/10.1063/5.0129404

  20. “Prediction of Consumer Purchasing in a Grocery Store Using Machine Learning Techniques.” Ieeexplore.ieee.org, ieeexplore.ieee.org/document/7941935.

  21. Barakat, Nahla, et al. “Intelligible Support Vector Machines for Diagnosis of Diabetes Mellitus.” IEEE Transactions on Information Technology in Biomedicine, vol. 14, no. 4, July 2010, pp. 1114–1120, https://doi.org/10.1109/titb.2009.2039485.

  22. “Applying Support Vector Machine to Electronic Health Records for Cancer Classification | IEEE Conference Publication | IEEE Xplore.” Ieeexplore.ieee.org, ieeexplore.ieee.org/abstract/document/8732906.

  23. “An Effective Intrusion Detection Approach Using SVM with Naïve Bayes Feature Embedding.” Computers & Security, vol. 103, 1 Apr. 2021, p. 102158, www.sciencedirect.com/science/article/pii/S0167404820304314, https://doi.org/10.1016/j.cose.2020.102158.

  24. Hosseini, Soodeh, and Behnam Mohammad Hasani Zade. “New Hybrid Method for Attack Detection Using Combination of Evolutionary Algorithms, SVM, and ANN.” Computer Networks, vol. 173, May 2020, p. 107168, https://doi.org/10.1016/j.comnet.2020.107168.